## Warning: package 'tweedie' was built under R version 3.6.2
## Warning: package 'ggplot2' was built under R version 3.6.2
## Warning: package 'fitdistrplus' was built under R version 3.6.2
## Warning: package 'MASS' was built under R version 3.6.2
## Warning: package 'survival' was built under R version 3.6.2
## Warning: package 'plotly' was built under R version 3.6.2
## Warning: package 'RSQLite' was built under R version 3.6.2
df_all = read.csv("/Users/PeterNovak/Desktop/Bayesian AB Testing/test_data_ba_1.csv")
df_all$user_id = as.character(df_all$user_id)
df_all$test_group = as.character(df_all$test_group)
df_all$is_payer = ifelse(df_all$total_spend == 0, 0, 1)
head(df_all)
df_payers = df_all[df_all$is_payer == 1, ]
head(df_payers)
One reason for this problem is the presence of duplicate values in the distribution, even worse in winsorised spend as it represents a higher proportion of total player spend:
To visualise and compare to a hypothesised continuous distribution which will always assume the probability of an exact value is 0 thus the probability the same value happens twice in a finite sampled data set is 0 as well.
So how well do most distributions do? The distribution is closest to tweedie out of them:
distributions = c(rlnorm, rexp, rtweedie)
distribution_name = c("lnorm", "exp", "tweedie")
dist_obj = get_dist(df_payers$total_wins_spend, distributions, distribution_name,
c("plum", "dodgerblue", "darkgreen"), T, F)
## Warning in fitdist(data = input_list, method = "mle", distr = "tweedie", :
## The dtweedie function should return a vector of with NaN values when input has
## inconsistent parameters and not raise an error
## Warning in fitdist(data = input_list, method = "mle", distr = "tweedie", : The
## ptweedie function should return a zero-length vector when input has length zero
## and not raise an error
dist_obj[1:2]
## $param_est
## power mu phi
## 3.0890528 23.0561183 0.2190246
##
## $model_used
## [1] "tweedie"
And all criterions outperform log-normal distribution with opt. parameters. p-value of ks-test 0.07
Conclusion:
The tweedie distribution is a better fit for our data, however the presence of duplicate spend values will make any unimodal distribution a likely poor fit. However it would have to be a highly irregular distribution because of the nature of the data (explained logic below):
This value is likely to be a “clean” value in the local currency (e.g. 0.99$, 1.99$, 4.99$, 9.99$, 49.99$, 99.99$)
Local currency is translated to USD but those spend combination are likely to be duplicitous as well.
a & b repeats when the same player makes a subsequent transaction which adds up to total spend per user.
All spends per user are summed up to make a per user total spent.
Daily spends are winsorized -> daily spend above certain threshold is set to threshold -> likely to create duplicitous total spend per user if user only spends on one day and spends above limit.
This process leads to dupl. spend values.
One caveat, is assuming the data follows a exponential distribution even wrong?
"One worry might be that selecting a model after considering the data is HARKing (hypothesizing after the results are known; Kerr 1998). In that famous article, Kerr discusses why HARKing may be inadvisable. In particular, HARKing can transform Type I errors (false alarms) into confirmed hypothesis ergo fact. I.e. the non-linear trend in the population data might be a random fluke, and the better fit by the non-linear distribution might be a random fluke.
Bayesian analysis is always conditional on the assumed model space. Often the assumed model space is merely a convenient default. The default is convenient because it is familiar to both the analyst and the audience of the analysis, but the default need not be a theoretical commitment. There are also different goals for data analysis: Describing the one set of data in hand, and generalizing to the population from which the data were sampled. Various methods for penalizing overfitting of noise are aimed at finding a statistical compromise between describing the data in hand and generalizing to other data."
Further:
"Indeed, according to Kleijn, v.d Vaart (2012), in the misspecified case, the posterior distribution:
converges as \(n \to \inf\) to a Dirac distribution cantered at a \(\hat{\theta}_{ML}\)
does not have the correct variance (unless two values just happen to be same) in order to ensure that credible intervals of the posterior match confidence intervals for \(\theta\)."
See for papers about updating prior distributions alongside prior parameters when presented with new data:
https://rss.onlinelibrary.wiley.com/doi/full/10.1111/rssb.12158
https://xianblog.wordpress.com/2013/07/15/a-general-framework-for-updating-belief-functions/
Investigate best fit distribution for each client data set.
Create a “population” of spenders by analysing every dataset we have so far about player spend.
load_df = function(file_path) {
df_all = read.csv(file_path)
df_all$user_id = as.character(df_all$user_id)
df_all$test_group = as.character(df_all$test_group)
df_all$is_payer = ifelse(df_all$total_spend == 0, 0, 1)
df_payers = df_all[df_all$is_payer == 1, ]
return(df_payers)
}
visualise_df = function(df, name) {
name = gsub(" ", "\n", name)
n_breaks = max(min(nrow(df)/200, 40), 20)
temp_dens = density(df$total_wins_spend, adjust = 2)
temp_dens_log = density(log(df$total_wins_spend), adjust = 2)
hist(df$total_wins_spend,
col = "floralwhite",
breaks = n_breaks,
prob = TRUE,
xlab = "Payer Winz. Spend",
main = "Hist.: Payer Winz Spend")
lines(temp_dens,
col = "firebrick4")
legend("topright", legend = name, bty = "n")
hist(log(df$total_wins_spend),
col = "floralwhite",
breaks = n_breaks,
prob = TRUE,
xlab = "Payer log(Winz. Spend)",
main = "Hist.: Payer ln(Winz Spend)")
lines(temp_dens_log,
col = "firebrick4")
legend("topright", legend = name, bty = "n")
}
descdist_to_list = function(descdist_obj) {
l_out = c(descdist_obj$min,
descdist_obj$max,
descdist_obj$median,
descdist_obj$mean,
descdist_obj$sd,
descdist_obj$skewness,
descdist_obj$kurtosis)
return(l_out)
}
df_bingo_aloha = load_df("/Users/PeterNovak/Desktop/Bayesian AB Testing/data/data_bingo_aloha.csv")
df_homw = load_df("/Users/PeterNovak/Desktop/Bayesian AB Testing/data/data_homw.csv")
df_idle_mafia = load_df("/Users/PeterNovak/Desktop/Bayesian AB Testing/data/data_idle_mafia.csv")
df_spongebob = load_df("/Users/PeterNovak/Desktop/Bayesian AB Testing/data/data_spongebob.csv")
df_terra_genesis = load_df("/Users/PeterNovak/Desktop/Bayesian AB Testing/data/data_terra_genesis.csv")
df_ultimex = load_df("/Users/PeterNovak/Desktop/Bayesian AB Testing/data/data_ultimex.csv")
par(mar = c(4, 3, 3, 3), mgp = c(2, 0.5,0))
graphics::layout(mat = matrix(1:6, nrow = 2, ncol = 3, byrow = F),
heights = rep(1, 6), widths = rep(1, 6))
visualise_df(df_bingo_aloha, "Bingo Aloha")
visualise_df(df_homw, "HOMW")
visualise_df(df_idle_mafia, "Idle Mafia")
visualise_df(df_spongebob, "Spongebob")
visualise_df(df_terra_genesis, "Terra Genesis")
visualise_df(df_ultimex, "Ultimate X-Poker")
n_boots = 500
dist_bingo_aloha = descdist(df_bingo_aloha$total_wins_spend, boot = n_boots)
dist_homw = descdist(df_homw$total_wins_spend, boot = n_boots)
dist_idle_mafia = descdist(df_idle_mafia$total_wins_spend, boot = n_boots)
dist_spongebob = descdist(df_spongebob$total_wins_spend, boot = n_boots)
dist_terra_genesis = descdist(df_terra_genesis$total_wins_spend, boot = n_boots)
dist_ultimex = descdist(df_ultimex$total_wins_spend, boot = n_boots)
mat_out = matrix(data = c(descdist_to_list(dist_bingo_aloha),
descdist_to_list(dist_homw),
descdist_to_list(dist_idle_mafia),
descdist_to_list(dist_spongebob),
descdist_to_list(dist_terra_genesis),
descdist_to_list(dist_ultimex)),
nrow = 6, ncol = 7, byrow = T,
dimnames = list(c("Bingo Aloha",
"HOMW",
"Idle Mafia",
"Spongebob",
"Terra Genesis",
"Ultimate X-Poker"),
c("Min",
"Max",
"Median",
"Mean",
"SD",
"Skewness",
"Kurtosis")))
as.data.frame(mat_out)
distributions = c(rlnorm, rweibull, rexp)#, rtweedie)
distribution_name = c("lnorm", "weibull", "exp")#, "tweedie")
colours = c("plum", "darkgreen", "dodgerblue")#, "cyan")
fit_bingo_aloha = get_dist(df_bingo_aloha$total_wins_spend,
distributions, distribution_name, colours, T, F, F)
fit_homw = get_dist(df_homw$total_wins_spend/100,
distributions, distribution_name, colours, T, F, F)
fit_idle_mafia = get_dist(df_idle_mafia$total_wins_spend/100,
distributions, distribution_name, colours, T, F, F)
fit_spongebob = get_dist(df_spongebob$total_wins_spend,
distributions, distribution_name, colours, T, F, F)
fit_terra_genesis = get_dist(df_terra_genesis$total_wins_spend,
distributions, distribution_name, colours, T, F, F)
fit_ultimex = get_dist(df_ultimex$total_wins_spend,
distributions, distribution_name, colours, T, F, F)
mat_out2 = matrix(data = c(fit_bingo_aloha$model_used,
fit_homw$model_used,
fit_idle_mafia$model_used,
fit_spongebob$model_used,
fit_terra_genesis$model_used,
fit_ultimex$model_used),
nrow = 6, ncol = 1, byrow = T,
dimnames = list(c("Bingo Aloha",
"HOMW",
"Idle Mafia",
"Spongebob",
"Terra Genesis",
"Ultimate X-Poker"),
c("Model Used")))
as.data.frame(mat_out2)
All get lnorm as best dist.